Add comparison support for Union arrays#8838
Conversation
0ceac84 to
87c792b
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks @friendlymatthew -- this looks good to me. The only thing I think it needs are some tests of the error cases -- namely
- Compare an out of bounds index (expect panic)
- Try to compare incompatible union types
I also left some other small suggestions but I don't think they are needed
| let left = left.as_union(); | ||
| let right = right.as_union(); | ||
|
|
||
| let (left_fields, left_mode) = match left.data_type() { |
There was a problem hiding this comment.
This is weird to have to re-check the DataTypes.
What would you think about adding UnionArray::fields() and UnionArray::mode() methods to make the code easier to work with?
There was a problem hiding this comment.
This should be super quick to review: #8884
Somewhat related but it feels a bit weird that the following works without any notice to the user:
#[test]
fn test_union_fields() {
let ids = vec![0, 1, 2];
let field = Field::new("a", DataType::Binary, true);
// different length of ids and fields (we zip so we truncate the longer vec)
let _out = UnionFields::new(ids.clone(), vec![field.clone()]);
// duplicate fields associated with different type ids!
let _out = UnionFields::new(ids, vec![field.clone(), field]);
}I feel like we could benefit from a bit more validation? We could leave UnionFields::new but also have a UnionFields::try_new that checks the above 🤔
There was a problem hiding this comment.
Yes, I think that sounds like a good idea to me
We can even deprecate UnionFields::new to help people migrate over
There was a problem hiding this comment.
Here is another minor convenience improvement: #8895
arrow-ord/src/ord.rs
Outdated
|
|
||
| if left_fields != right_fields || left_mode != right_mode { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Cannot compare UnionArrays with different fields or modes".to_string(), |
There was a problem hiding this comment.
I recommend adding more details to this message to help when people hit it -- specifically, I recommend
- a separate message for different modes (and include the modes in the error message)
- Add the fields (
{fields:?}style) to the message
|
|
||
| let c_opts = child_opts(opts); | ||
|
|
||
| let mut field_comparators = HashMap::with_capacity(left_fields.len()); |
There was a problem hiding this comment.
rather than a hash map you could potentially just use a 128 valued Vec<> indexed by the typeids -- since typeid is i8 you know there can be at most 128 values that might be faster to lookup than hashing/hash table
There was a problem hiding this comment.
Hm so this was my first thought/approach as well, but I decided to use a hashmap because it avoids superfluous memory usage for sparse sets
Plus, I don't think this is a very hot path, so any perf differences wouldn't be super meaningful
779ea23 to
ad9027b
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @friendlymatthew
# Which issue does this PR close? This PR adds another method on the `UnionArray` api that returns a list of `FieldRef`s associated with the union type See: #8838 (comment)
Which issue does this PR close?
cmpkernel #8837Uniondata types for row format #8828Rationale for this change
This PR implements comparison functionality for Union arrays. This implementation follows a simple ordering strategy where unions are first compared by their type identifier, and only when type identifiers match are the actual values within those types compared
This approach handles both sparse and dense union modes correctly by using offsets when present (dense unions) or direct indices (sparse unions) to locate the appropriate child array values